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SUMMARY 


This  document  provides  a  summary  of  work  completed  by  government  researchers  and  SRA 
International  under  the  work  unit  H06K  (5328X02S),  Foreign  Language  Analysis  and  Recognition 
(FLARe).  This  work  was  performed  over  the  period  1  October  2012  to  30  November  2014  under 
contract  FA8650-09-D-6939. 

The  following  tasks  were  completed  on  Automatic  Speech  Recognition  (ASR).  Korean  language 
models  (LMs)  were  developed  to  reduce  the  number  of  Out-of-Vocabulary  (OOV)  words 
encountered  by  the  recognizer.  Levantine  Arabic  and  Farsi  ASR  systems  were  trained  on 
conversational  telephone  speech.  Three  different  methods  were  investigated  for  combining  Pashto 
ASR  systems.  Software  was  developed  for  training  and  evaluating  hybrid  deep  neural  network 
(DNN)  hidden  Markov  model  (HMM)  speech  recognition  systems.  An  English  ASR  system  was 
developed  for  the  International  Workshop  on  Spoken  Language  Translation  (IWSLT)  2013 
evaluation.  Six  different  techniques  were  investigated  for  interpolating  LM  probabilities.  Finally, 
English  and  Italian  ASR  systems  were  developed  for  the  IWSLT  2014  evaluation. 

Improvements  were  made  to  the  Haystack  Multilingual  Multimedia  Information  Extraction  and 
Retrieval  (MMIER)  system  that  was  initially  developed  under  a  prior  work  unit.  Major  additions 
to  the  user  interface  include  the  following:  support  for  uploading  multiple  fdes,  expansive  changes 
to  the  media  player,  additional  Machine  Translation  (MT)  capabilities,  and  integration  of 
geolocation  information.  Scripts  were  developed  for  translating  web  pages  and  displaying  the 
results  in  the  same  format  as  the  input.  Research  into  HTML5  was  initiated  to  improve 
functionality  across  different  operating  systems.  The  processing  pipeline  was  updated  to  provide 
support  for  decoding  hybrid  DNN-HMM  systems,  support  for  N-gram  and  Recurrent  Neural 
Network  (RNN)  LM  rescoring,  and  improved  text  extraction  from  Portable  Document  Format 
(PDF)  fdes.  Japanese,  Chinese,  and  Pashto  speech  recognition  systems  were  developed  and  then 
incorporated  into  Haystack. 
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1.0 


INTRODUCTION 


This  document  provides  a  summary  of  work  completed  by  government  researchers  and  SRA 
International  under  the  work  unit  5328X02S,  Foreign  Language  Analysis  and  Recognition 
(FLARe).  This  work  was  performed  over  the  period  1  October  2012  to  30  November  2014  under 
contract  FA8650-09-D-6939. 

The  following  tasks  were  completed  on  automatic  speech  recognition  (ASR).  Korean  language 
models  (LMs)  were  developed  to  reduce  the  number  of  out-of-vocabulary  (OOV)  words 
encountered  by  the  recognizer.  Levantine  Arabic  and  Farsi  ASR  systems  were  trained  on 
conversational  telephone  speech.  Three  different  methods  were  investigated  for  combining 
Pashto  ASR  systems.  Software  was  developed  for  training  and  evaluating  hybrid  deep  neural 
network  (DNN)  hidden  Markov  model  (HMM)  speech  recognition  systems.  An  English  ASR 
system  was  developed  fprp  the  International  Workshop  on  Spoken  Language  Translation 
(IWSLT)  2013  evaluation.  Six  different  techniques  were  investigated  for  interpolating  LM 
probabilities.  Finally,  English  and  Italian  ASR  systems  were  developed  for  the  IWSLT  2014 
evaluation. 

Improvements  were  made  to  the  Haystack  multilingual  multimedia  information  extraction  and 
retrieval  (MMIER)  system  that  was  initially  developed  under  a  prior  work  unit.  Major  additions 
to  the  user  interface  include  the  following:  support  for  uploading  multiple  fdes,  expansive 
changes  to  the  media  player,  additional  machine  translation  (MT)  capabilities,  and  integration  of 
geolocation  information.  Scripts  were  developed  for  translating  web  pages  and  displaying  the 
results  in  the  same  format  as  the  input.  Research  into  HTML5  was  initiated  to  improve 
functionality  across  different  operating  systems.  The  processing  pipeline  was  updated  to  provide 
support  for  decoding  hybrid  DNN-HMM  systems,  support  for  N-gram  and  recurrent  neural 
network  (RNN)  LM  rescoring,  and  improved  text  extraction  from  portable  document  format 
(PDF)  fdes.  Japanese,  Chinese,  and  Pashto  speech  recognition  systems  were  developed  and  then 
incorporated  into  Haystack. 

This  report  is  organized  as  follows.  Section  2.0  describes  the  experiments  and  accomplishments. 
Section  3.0  summarizes  conclusions  drawn  from  the  experiments. 
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2.0 


EXPERIMENTS  AND  ACCOMPLISHMENTS 


This  section  discusses  the  experiments  and  accomplishments  for  the  covered  period.  Section 
2.1  discusses  the  ASR  experiments  that  were  performed,  and  Section  2.2  describes  the 
improvements  made  to  the  Haystack  MMIER  system. 

2. 1  ASR  Experiments 

This  section  discusses  the  ASR  experiments  that  were  conducted.  Section  2.1.1  describes  how 
Korean  ASR  systems  were  designed  to  reduce  the  effects  of  OOV  words.  Section  2.1.2  presents 
the  Levantine  Arabic  and  Farsi  ASR  systems  that  were  developed  on  conversational  telephone 
speech.  Section  2.1.3  describes  three  methods  that  were  investigated  for  combining  Pashto  ASR 
systems.  Section  2.1.4  discusses  software  that  was  developed  for  training  and  evaluating  hybrid 
DNN-HMM  speech  recognition  systems.  Section  2.1.5  presents  the  English  ASR  system  that 
was  developed  for  the  IWSLT  2013  evaluation  campaign.  Section  2.1.6  describes  several 
methods  that  were  investigated  for  performing  LM  interpolation.  Finally,  Section  2.1.7  describes 
the  English  and  Italian  ASR  systems  that  were  developed  for  IWSLT  2014. 

2.1.1.  Morfessor  for  Korean  ASR 

Korean  ASR  systems  were  designed  to  reduce  the  effects  of  OOV  words  encountered  by  the 
recognizer.  OOV  words  are  those  words  spoken  by  a  person  that  are  not  in  the  pronunciation 
dictionary  and  LM  for  an  ASR  system;  as  a  result,  they  will  never  appear  in  the  output  of  the 
recognizer,  thereby  increasing  the  error  rate.  To  reduce  the  number  of  OOV  words,  Korean  LMs 
were  estimated  using  both  words  and  sub-word  units  that  can  be  combined  to  form  words. 

Korean  sub-word  units  were  automatically  derived  using  Morfessor  [1]  with  the  baseline 
algorithm  and  the  categories-MAP  algorithm  with  perplexity  thresholds  of  10,  50,  100,  and  400. 
The  following  procedure  was  used  to  incorporate  these  sub-word  units  into  the  recognizer: 

•  Evaluate  Morfessor  on  the  text  corpus 

•  Create  a  pronunciation  dictionary  by  applying  letter-to-sound  rules 

•  Train  an  LM  on  the  sub-word  units,  and  attach  a  +  sign  to  the  start  of  every  sub¬ 
word  unit  except  for  the  first  sub- word  unit  from  a  word 

•  Evaluate  the  recognizer  using  the  pronunciation  dictionary  and  sub-word  LM 

•  Attach  sub-word  units  that  start  with  a  +  sign  to  the  previous  word  or  sub-word  unit 

This  procedure  was  applied  to  text  from  GlobalPhone  [2],  the  Korean  Broadcast  News  corpus 
[3],  the  Korean  Newswire  corpus  [4],  and  articles  downloaded  from  Wikipedia.1  Interpolated 
trigram  LMs  were  estimated  using  the  Stanford  Research  Institute  LM  (SRILM)  toolkit  [5]. 
Unless  stated  otherwise,  all  N-gram  LMs  discussed  in  this  document  were  estimated  using 
modified 


'Available  at:  http://dumps.wikimedia.org/kowiki 
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Knesey-Ney  smoothing.  The  vocabulary  for  each  LM  included  500000  tokens  and  was  chosen 
using  the  select-vocab  program  from  the  SRILM  toolkit. 

Acoustic  Models  (AMs)  were  trained  on  GlobalPhone  and  the  Korean  Broadcast  News  corpus 
using  HTK  [6].  Pronunciations  for  all  words  were  derived  using  letter-to-sound  rules  [7], 
Phonemes  were  modeled  using  state-clustered  across-word  triphone  HMMs,  and  the  final  HMM 
set  included  3000  shared  states  with  an  average  of  16  mixtures  per  state.  The  models  were 
discriminatively  trained  using  the  Minimum  Phone  Error  (MPE)  criterion.  The  feature  set 
consisted  of  12  Perceptual  Linear  Prediction  (PLP)  coefficients,  plus  the  zeroth  coefficient,  with 
mean  normalization  applied  on  a  per  utterance  basis.  Delta,  acceleration,  and  third  differential 
coefficients  were  appended  to  form  a  52  dimensional  feature  vector,  and  Heteroscedastic  Linear 
Discriminate  Analysis  (HLDA)  was  applied  to  reduce  the  feature  dimension  to  39.  A  second  set 
of  models  was  estimated  that  included  Speaker  Adaptive  Training  (SAT). 

Each  set  of  models  was  evaluated  on  the  GlobalPhone  and  Korean  Broadcast  News  development 
partition.  Initial  transcripts  were  produced  using  the  HTK  large  vocabulary  continuous  speech 
recognizer  HDecode.  Constrained  Maximum  Likelihood  Linear  Regression  (CMLLR) 
transforms  were  estimated  for  each  speaker,  and  the  final  recognition  hypotheses  were  generated 
using  the  SAT  HMMs.  Table  1  shows  the  Character  Error  Rate  (CER)  and  Word  Error  Rate 
(WER)  obtained  with  each  system.  The  sub-word  units  yielded  an  improvement  in  CER  and 
WER  on  both  partitions. 

Table  1:  Korean  CER  and  WER  on  the  GlobalPhone  and  Korean  Broadcast  News 

Development  Partitions 


GlobalPhone  Broadcast  News 


Morfessor  Algorithm 

CER 

WER 

CER 

WER 

None 

12.1 

51.6 

14.2 

39.0 

Baseline 

10.9 

42.9 

13.4 

38.5 

Categories-MAP  Perplexity  10 

11.1 

43.5 

13.4 

37.9 

Categories-MAP  Perplexity  50 

11.3 

45.6 

13.5 

38.5 

Categories-MAP  Perplexity  100 

11.3 

46.1 

13.5 

38.6 

Categories-MAP  Perplexity  400 

11.3 

47.2 

13.5 

38.7 

2.1.2.  Conversational  T  elephone  ASR 

This  section  describes  the  Levantine  Arabic  and  Farsi  ASR  systems  that  were  developed  on  con¬ 
versational  telephone  speech.  This  is  a  particularly  difficult  task  because  conversational  speech  is 
highly  coarticulated,  less  predictable  than  other  types  of  speech  ( e.g .,  read  speech,  lectures,  or 
broadcast  news),  and  typically  includes  more  sentence  restarts,  word  fragments,  and  filled 
pauses.  In  addition,  ASR  systems  perform  worse  on  telephone  speech  due  to  channel  variability, 
reduced  bandwidth,  and  transmission  artifacts. 

Levantine  Arabic :  An  ASR  system  was  developed  on  3 1  hours  of  speech  from  the  Levantine 
Arabic  Conversational  Telephone  Speech  Corpus  (ARB-CTS)  [8].  Prior  to  training  the  AMs, 
long  periods  of  silence  were  removed  from  the  audio  files  using  an  amplitude-based  Speech 
Activity  Detector  (SAD).  Gain  normalization  was  applied  to  each  utterance  so  that  the  maximum 
sample  value  was  32767  and  100  millisecond  frames  were  extracted  every  50  milliseconds. 
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Frames  were  classified  as  speech  if  the  maximum  sample  value  was  greater  than  2000,  and 
silence  otherwise.  All  speech  end  points  were  padded  by  200  milliseconds  and  the  silence 
regions  were  removed  from  each  utterance.  In  a  preliminary  experiment,  this  process  yielded  a 
7.5%  absolute  improvement  in  WER. 

AMs  were  trained  using  the  same  procedure  described  in  Section  2.1.1,  except  the  feature  mean 
and  variance  normalization  were  applied  on  a  conversation  side  basis.  The  final  HMM  set 
included  3000  shared  states  with  an  average  of  24  mixtures  per  state.  The  HMM  system  was 
evaluated  using  a  trigram  LM  that  was  estimated  on  the  training  transcripts.  This  system  yielded 
a  60.0%  WER  on  the  ARB-CTS  test  partition. 

A  second  ASR  system  was  developed  on  138  hours  of  speech  from  the  Levantine  Arabic  QT 
training  data  set  5  (ARB-QT)  [9].  This  system  was  trained  using  the  same  procedure  described 
above,  except  that  the  amplitude  SAD  was  not  applied  because  the  utterances  did  not  include 
large  regions  of  silence.  The  final  HMM  set  included  5000  shared  states  with  an  average  of  28 
mixtures  per  state.  Decoding  was  performed  using  a  trigram  LM  that  was  estimated  on  the 
training  transcripts.  This  system  yielded  a  5 1 .0%  WER  on  the  ARB-QT  test  partition. 

Farsi :  Speech  recognition  systems  were  developed  on  eight  hours  of  speech  from  the  Appen 
mobile  network  mini  database  (ASR001)  and  20  hours  of  speech  from  the  the  Appen 
conversational  telephone  speech  corpus  (ASR002).2  Long  periods  of  silence  were  removed  from 
the  training  files  using  the  amplitude  based  SAD  described  above.  In  a  preliminary  experiment, 
this  yielded  a  1 .2%  absolute  improvement  in  WER. 

An  initial  set  of  AMs  was  trained  using  the  same  procedure  as  the  Levantine  Arabic  systems.  The 
HMM  system  included  2000  shared  states  with  an  average  of  20  mixtures  per  state.  Phoneme 
alignments  were  generated  for  the  entire  training  partition,  and  any  utterance  that  included  a 
phoneme  duration  greater  than  one  second  was  sequestered  from  the  training  set.  AMs  were 
retrained  on  the  modified  training  set  using  the  same  procedure  described  above.  In  a  preliminary 
experiment,  sequestering  training  utterances  with  long  phoneme  durations  yielded  a  0.7% 
absolute  improvement  in  WER. 

LMs  were  estimated  on  ASR002;  the  Translation  System  for  Tactical  Use  (TRANSTAC)  corpus; 
the  Uppsala  Persian  corpus  [10];  the  Tehran  English-Persian  corpus  [11];  translated  text  from 
Technology,  Entertainment,  And  Design  (TED)  conferences;3  and  articles  downloaded  from 
Wikipedia.4 5  Note  that  only  the  ASR002  text  includes  diacritics.  One  trigram  LM  was  estimated 
on  the  ASR002  text  that  included  diacritics,  and  a  second  LM  was  estimated  on  the  same  set  of 
text  with  all  diacritics  removed.  An  interpolated  trigram  LM  was  trained  on  all  of  the  text 
without  diacritics. 

Each  system  was  evaluated  on  the  ASR002  development  partition,  and  all  diacritics  were 
removed  prior  to  calculating  the  WER.  The  ASR002  LM  with  diacritics  yielded  a  60.6%  WER, 
and  the  ASR002  LM  without  diacritics  yielded  a  62.6%  WER.  The  interpolated  trigram  LM 
trained  on  all  sources  yielded  a  62.3%  WER. 


2Appen  corpora  are  available  at:  http://www.appen.com 

3 Available  at:  http://www.ted.com 

4Available  at:  http://dumps.wikimedia.org/fawiki 

5 

Distribution  A:  Approved  for  Public  Release;  Distribution  unlimited;  88ABW-2015-2172;  Cleared  29  April  2015 


2.1.3.  Pashto  System  Combination 

Three  different  methods  were  investigated  for  combining  Pashto  Speech  Recognition  Systems: 
Recognizer  Output  Voting  Error  Reduction  (ROVER)  [12],  N-best  ROVER,  and  word 
posterior  decoding  using  matching  scores  from  the  Driven  Decoding  Algorithm  (DDA)  [13]. 
ROVER  aligns  the  1-best  hypotheses  from  multiple  ASR  systems  and  applies  a  voting  scheme 
to  select  the  best  transcript.  The  1-best  hypotheses  were  obtained  using  word  posterior 
probability  decoding  [14],  and  ROVER  was  applied  using  the  SRover  program  from  the  Brno 
toolkit.5  N-best  ROVER  creates  a  confusion  network  using  the  N-best  lists  from  multiple  ASR 
systems  and  selects  the  word  with  the  highest  posterior  probability  from  each  correspondence 
set.  This  was  accomplished  using  the  nbest-rover  program  from  the  SRILM  toolkit. 

The  third  method  computes  matching  scores  by  aligning  N-best  hypotheses  from  a  primary 

ASR  system  to  an  auxiliary  transcript  produced  by  one  or  more  secondary  ASR  systems.  Each 

N-best  hypothesis  from  the  primary  ASR  system  is  aligned  to  the  auxiliary  transcript  using  a 

Dynamic  Programming  (DP)  algorithm.  The  DP  alignment  was  implemented  using  the  same 

method  as  the  sclite  program  from  the  National  Institute  of  Standards  and  Technology  (NIST) 

speech  recognition  scoring  toolkit.6  Consider  a  single  hypothesis  from  the  primary  system  W  = 

(Wi,  W  2,  ' ' ' ,  Wl)  that  isalignedto  uri  _  ..,/ \  the  auxiliary  transcript 

u  -  iwLr 

A  matching  score  0{  wi)  was  assigned  to  each  word  based  on  the  number  of  words  in  the  history 
that  match  the  auxiliary  transcript 


0.99 

if  {wj}  = 

K1 

for  i  - 

-3 

< 

j 

< 

i  and 

j 

> 

0 

0.9 

if  iwj}  = 

K> 

for  i  - 

_  2 

< 

j 

< 

i  and 

j 

> 

0 

0.4 

if  {wj}  = 

K> 

for  i  - 

-  1 

< 

j 

< 

i  and 

j 

> 

0 

0.1 

Wi  =  w( 

0.01 

Wi  ^  w' 

The  final  score  for  each  word  was  a  weighted  combination  of  the  matching  score,  the  AM  score, 
the  LM  score,  and  the  word  insertion  penalty.  The  1-best  transcript  was  selected  from  the  N- 
best  list  using  posterior  probability  decoding. 

Each  method  was  evaluated  using  three  Pashto  ASR  systems:  one  hybrid  DNN-HMM  system 
and  two  HMM  systems.  The  weights  for  each  system  were  tuned  on  the  TRANSTAC 
development  partition  using  the  nbest-optimize  program  from  the  SRILM  toolkit.  The  hybrid 
DNN-HMM  system  was  used  as  the  primary  system  when  calculating  matching  scores,  and  the 
auxiliary  transcript  was  obtained  by  combining  the  two  HMM  system  using  N-best  ROVER. 
Table  2  shows  the  WERs  obtained  on  the  TRANSTAC  test  partition.  N-best  ROVER  yielded 
the  best  performance. 


5Availableat:  http://speech.fit.vutbr.cz/software/hmm-toolkit-stk 

6Available  at:  http://www.itl.nist.gov/iad/mid/tools 
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Table  2:  Pashto  WER  on  the  TRANSTAC  Test  Partition 

Three  ASR  systems  were  evaluated  and  system  combination  was  performed  using  ROVER,  N -best  ROVER,  and 

word  posterior  decoding  with  DDA  matching  scores. 


System  Combination 

WER 

None 

34.4/33.4/32.9 

ROVER 

31.9 

N-best  ROVER 

31.4 

DDA  matching  scores 

31.9 

2.1.4.  Hybrid  DNN-HMM  Systems 

This  section  describes  the  software  that  was  developed  for  training  and  evaluating  hybrid  DNN- 
HMM  speech  recognition  systems.  Whereas  standard  HMM  systems  model  observation 
probabilities  using  Gaussian  Mixture  Models  (GMMs),  hybrid  DNN-HMM  systems  replace  the 
GMMs  in  a  well-trained  HMM  system  with  a  DNN.  In  the  context  of  this  paper,  DNNs  are  feed 
forward  neural  networks  with  more  than  one  hidden  layer.  The  procedure  for  developing  a  hybrid 
DNN-HMM  system  can  be  summarized  as  follows: 

•  Train  a  state-clustered  GMM-HMM  system 

•  Generate  HMM  state-level  time  alignments  of  the  training  data  using  forced  alignment 

•  Train  a  DNN  to  model  the  shared  states  of  the  GMM-HMM  system 

•  Use  the  DNN  instead  of  the  GMMs  when  evaluating  the  recognizer 

The  GMM-HMM  system  and  state-level  time  alignments  can  be  generated  using  HTK.  DNNs 
were  trained  using  layer  growing  back  propagation  [15].  This  method  estimates  the  parameters 
for  a  DNN  by  first  initializing  a  one  hidden  layer  network  with  random  weights  and  training  the 
network  to  convergence  using  error  back  propagation.  Next,  the  output  layer  is  replaced  with  a 
second  randomly  initialized  hidden  layer,  followed  by  a  randomly  initialized  output  layer.  This 
network  is  then  trained  to  convergence,  and  the  process  of  replacing  the  output  layer  and 
retraining  the  network  is  repeated  until  the  DNN  includes  the  desired  number  of  hidden  layers. 

Two  different  programs  were  investigated  for  training  DNNs:  the  International  Computer 
Science  Institute  (ICSI)  QuickNet  software  package* 7  and  Theano  [16].  To  train  DNNs  with 
QuickNet,  software  was  developed  to  convert  HTK  state-level  time  alignments  to  QuickNet  pfile 
format  and  to  replace  the  output  layer  in  QuickNet  Matlab  Level-4  network  files  with  a  randomly 
initialized  hidden  layer  and  output  layer.  One  limitation  of  QuickNet  is  that  it  only  supports  a 
maximum  of  three  hidden  layers;  software  was  developed  using  Theano  to  train  deeper  networks. 
Python  code  was  written  to  read  input  vectors  into  a  cache,  apply  a  context  window,  remove 
unwanted  samples,  randomize  the  data,  and  copy  the  data  to  the  Graphical  Processing  Unit 
(GPU).  The  DNN  training  algorithm  and  evaluation  routines  were  implemented  in  Theano  by 
modifying  the  multilayer  perceptron  code  from  [17]. 

7Availableat:  http://wwwl.icsi.berkeley.edu/Speech/icsi-speech-tools.html 
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Table  3:  English  WER  on  the  IWSLT  dev2010  Partition  using  Hybrid  DNN-HMM 

Systems 

QuickNet  was  used  to  train  DNNs  with  1—3  hidden  layers,  and  Theano  was  used  to  train  DNNs  with  1-5  hidden 

layers. 


Hidden  Layers 

DNN  software 

1 

2 

3 

4 

5 

QuickNet 

24. 

21. 

20. 

- 

- 

Theano 

24. 

21. 

20. 

20. 

19. 

5 

7 

7 

1 

8 

8 

Lastly,  HDecode  and  the  Sphinx-4  speech  recognizer  were  modified  to  read  HMM  state 
likelihoods  from  HTK  feature  files  [18].  For  a  given  state  S  and  observation  vector  O,  the 
posteriors  from  the  DNN  were  converted  to  likelihoods  by  dividing  by  the  prior  probability  of 
each  state,  i.e.. 


P{o\s) 


P(s\o)P(o) 

P(s) 


(2) 


where  P  (s|o)  is  the  posterior  probability  estimated  by  the  DNN,  P  (s)  is  the  prior 
probability  of  S  estimated  from  the  training  data,  and  P  (o)  is  a  constant  that  can  be  ignored. 

To  compare  QuickNet  and  Theano,  hybrid  DNN-FIMM  systems  were  developed  on  58  hours 
of  TED  talks.  The  GMM-FIMM  models  were  trained  using  the  same  procedure  described  in 
Section  2.1.1,  and  the  final  HMM  set  included  3000  shared  states  with  an  average  of  24 
mixtures  per  state.  DNNs  were  trained  using  a  maximum  of  5  hidden  layers,  each  of  which 
had  1000  neurons  with  logistic  activation  functions.  A  context  window  of  9  frames  was  used 
at  the  input,  and  the  output  included  3000  units  corresponding  to  the  shared  states  of  the 
GMM-HMM  system.  The  feature  set  consisted  of  13  PLPs  with  delta  and  acceleration 
coefficients,  and  all  features  were  normalized  to  zero  mean  and  unit  variance  on  a  per  speaker 
basis.  Training  was  performed  with  a  minibatch  size  of  5 12,  and  an  initial  learning  rate  of 
0.008  that  was  halved  after  each  epoch  once  the  improvement  in  accuracy  on  the  cross 
validation  partition  fell  below  0.5%.  Training  was  completed  once  the  improvement  in 
accuracy  fell  below  0.5%  a  second  time.9 

Each  system  was  evaluated  on  the  dev2010  partition  from  the  IWSLT  evaluation  campaign 
[19].  Decoding  was  performed  using  a  single  pass  of  HDecode  with  a  trigram  LM  that  was 
developed  for  IWSLT  2012  [20].  Table  3  shows  the  WERs  obtained  with  each  DNN.  For 
comparison  purposes,  the  GMM-HMM  system  was  evaluated  using  the  same  procedure 
described  in  Section  2.1.1;  the  first  pass  yielded  a  22.0%  WER,  and  the  second  pass  yielded  a 
19.9%  WER. 

2.1.5.  IWSLT  2013 

This  section  describes  the  English  ASR  system  that  was  developed  for  the  IWSLT  2013 
evaluation  campaign.  This  task  focuses  on  the  automatic  transcription  of  TED  talks,  which  are 

Available  at:  http://cmusphinx.sourceforge.net 
yThis  is  the  QuickNet  newbob  training  strategy 
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professionally  recorded  presentations  given  on  a  variety  of  topics  related  to  technology, 
entertainment,  and  design. 

Each  talk  is  a  maximum  of  18  minutes  in  length.  The  TED  website10  makes  the  video  recordings 
and  closed  captions  from  over  1900  talks  available  for  download. 

AMs  were  trained  on  807  TED  talks  that  were  recorded  prior  to  201 1.  The  audio  was  extracted 
from  each  video  fde  using  FFmpeg,1 1  and  then  downsampled  to  16  kHz  using  SoX.12  Long 
periods  of  untranscribed  audio  were  removed  from  each  talk  using  the  time  marks  from  the 
closed  captions,  and  word  alignments  were  automatically  generated  using  an  HTK  HMM  system 
developed  on  HUB4  [21,  22],  These  alignments  were  used  to  split  each  talk  into  utterances  that 
were  shorter  than  20  seconds  and  included  0.1-  0.25  seconds  of  non-speech  at  the  end  points. 
Next,  closed  caption  filtering  [23]  was  applied  to  the  TED  data  to  sequester  utterances  that  may 
include  transcription  errors.  Each  talk  was  decoded  using  the  HUB4  HMMs  and  a  trigram  LM 
that  was  estimated  on  the  transcripts  for  the  talk.  The  recognizer  outputs  were  compared  to  the 
transcripts,  and  a  data  partition  was  created  using  all  utterances  with  a  WER  less  than  30%.  This 
process  yielded  166  hours  of  audio. 

A  speaker  independent  hybrid  DNN-HMM  speech  recognition  system  was  developed  using  the 
Theano  software  described  in  Section  2.1.4.  The  GMM-HMM  system  included  6000  shared 
states  with  an  average  of  28  mixtures  per  state;  the  DNN  included  a  context  window  of  9  frames 
on  the  input,  5  hidden  layers  with  1000  units  each,  and  6000  output  units.  A  speaker  adaptive 
DNN  was  trained  on  PLP  features  that  were  transformed  using  CMLLR.  This  system  applied  a 
single  transform  per  speaker. 

LMs  were  developed  on  the  TED  data  provided  by  IWSLT,  the  English  Gigaword  corpus  [24], 
and  the  News  2007-2012  texts  from  the  Association  for  Computational  Linguistics  Workshop  on 
Machine  Translation  (WMT). 14  Cross  entropy  difference  scoring  [25]  was  used  to  select  subsets 
of  Gigaword  and  News  2007-2012  that  matched  the  TED  domain.  Interpolated  trigram  and  4- 
gram  LMs  were  estimated  on  TED,  1/8  of  Gigaword,  and  1/4  of  News  2007-2012.  RNN 
maximum  entropy  LMs  were  developed  using  the  RNNLM  toolkit  [26].  One  RNN  was  trained 
on  1/16  of  Gigaword,  and  a  second  RNN  was  trained  on  1/8  of  News  2007-2012.  Each  network 
included  160  hidden  units,  300  classes  in  the  output  layer,  4-gram  features  for  the  direct 
connections,  and  a  hash  size  of  10.9  The  LM  vocabulary  included  95000  words  and  was  chosen 
using  the  select-vocab  program  from  the  SRILM  toolkit. 

Whereas  in  previous  IWSLT  evaluations  [19,  27]  the  test  data  was  manually  segmented  into 
spoken  utterances,  this  year  each  talk  was  provided  without  timing  information.  A  neural 
network-based  SAD  was  developed  using  Theano  to  segment  each  talk  into  utterances  and 
remove  long  periods  of  non-speech.  The  SAD  was  trained  on  22  hours  of  TED  data  and  5  hours 
of  public  domain  music  downloaded  from  Wikimedia  Commons,  the  United  States  Air  Force 
band,16  and  the  Open  Goldberg  Variations  project.17  The  network  included  a  context  window  of 
21  frames  on  the  input,  1  hidden  layer  of  500  neurons  with  logistic  activation  functions,  and  3 

- TO -  15 

^httpy/www.ted.com  Available  at:  http://commons.wikimedia.org 

^Available  at:  http://www.ffmpeg.org  16Available  at:  http://www.usafband.af.mil 

^Available  at:  http://sox.sourceforge.net  ^Available  at:  http://www.opengoldbergvariations.org 

Available  at:  http://workshop2013.iwslt.org 
Available  at:  http://www.statmt.org/wmtl3/translation-task.html 
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Table  4:  English  WER  on  the  IWSLT  2012  Development  Partitions  using  Manual 
and  Automatic  Segmentations  of  the  Data 


Manual 

Automatic 

System 

dev2010 

tst2010 

dev2012 

dev201 

tst2010 

dev2012 

Decode- 1 

14.3 

13.0 

15.3 

15.6 

14.3 

16.9 

Decode-2 

13.7 

12.3 

14.0 

14.8 

13.5 

15.8 

4-gram 

13.1 

11.6 

13.2 

13.9 

12.7 

14.9 

4-gram  +  RNN 

12.1 

10.3 

11.6 

12.8 

11.8 

13.8 

output  units  corresponding  to  speech,  silence/noise,  and  music.  The  feature  set  consisted  of  12 
PLP  coefficients,  plus  the  zeroth  coefficient,  with  delta  and  acceleration  coefficients.  All  features 
were  globally  nonnalized  to  zero  mean  and  unit  variance.  Six  epochs  of  training  were  performed 
with  a  minibatch  size  of  5 12,  and  an  initial  learning  of  0.008  that  was  halved  after  the  second 
epoch. 

Automatic  segmentation  of  the  test  data  was  performed  by  evaluating  the  SAD,  applying  a  DP 
algorithm  to  choose  the  best  sequence  of  states,  and  padding  the  speech  end  points  by  0. 15 
seconds.  The  speech  segments  from  each  talk  were  clustered  using  the  Massachusetts  Institute  of 
Technology  Lincoln  Laboratory  (MIT-LL)  GMM  software  package  [28].  Initial  transcripts  of  the 
test  data  were  produced  using  HDecode  with  the  interpolated  trigram  LM.  These  transcripts  were 
used  to  estimate  CMLLR  transforms  for  the  speaker  adaptive  hybrid  DNN-HMM  system.  A 
second  pass  of  HDecode  was  evaluated  to  generate  recognition  lattices,  which  were  then 
rescored  with  the  interpolated  4-gram  LM.  Next,  1000-best  lists  were  extracted  from  each  lattice 
and  rescored  with  the  RNN  LMs.  The  final  LM  scores  were  obtained  by  linearly  interpolating  the 
probabilities  from  the  4-gram  and  RNN  LMs.  Lastly,  the  maximum  scoring  utterance  was 
selected  for  each  utterance. 

Table  4  shows  the  WERs  obtained  on  the  IWSLT  development  partitions  at  each  decoding  stage. 
For  comparison  purposes,  results  are  shown  on  both  the  manually  produced  and  automatically 
derived  segmentations  of  the  data.  This  system  yielded  a  15.9%  WER  on  the  tst2013  partition 
and  placed  third  out  of  the  eight  ASR  systems  that  were  submitted  for  the  evaluation. 

2.1.5  LM  Interpolation 

Six  different  methods  were  investigated  for  interpolating  probabilities  from  4-gram  and  RNN 
LMs.  Note  that  the  LMs  described  in  this  paper  estimate  the  probability  P  ( w\h )  for  a  word  w 
with  history  h,  where  0  <  P  (w\h)  <  1.  One  of  the  most  popular  methods  for  combining 
probabilites  from  multiple  models  is  linear  interpolation.  Given  the  probabilities  Pk  ( w\h )  from  N 
models,  the  interpolated  probability  can  be  calculated  as 


P(w\h)  =  J2  hPkHh), 

k=i 


(3) 
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where  Ak  is  the  interpolation  weight  for  the  kth  model.  The  interpolation  weights  are  typically 
subject  to  the  constraints 

N 

0  <  Ak  <  1  and  ^2  =  1  ■ 

fc=i 

A  modified  version  of  linear  interpolation  was  implemented  where  the  range  of  probabilities  Pk 
{w\h)  from  the  individual  models  was  restricted.  For  a  given  word  W  and  history  /?,  the  minimum 
probability  from  any  model  was  set  to  the  maximum  probability  divided  by  an  empirically  chosen 
integer  L,  that  is 

Pmin  -  (max  Pfc(u;|/i)  )  /L. 

\l<k<N  )  (4) 

PkHh)  =  Pmin  if  PkHh)  <  Pmin. 

Recall  that  Equation  3  computes  a  weighted  sum.  Alternatively,  P  {w\h)  was  calculated  by 
selecting  the  minimum,  maximum,  or  median  of  Pk  {w\h)  for  1  ^  k  ^  N  .  Lastly,  LM 
interpolation  was  performed  by  linearly  interpolating  the  log  probabilities  from  multiple  models 

N 

P(w\h)  =  exp  ^  Ak  log  Pk(w\h).  (5) 

k=l 

Each  method  of  LM  interpolation  was  evaluated  on  the  IWSLT  2013  development  partitions.  The 
probabilites  Pk  ( w\h )  were  obtained  from  the  ASR  system  described  in  Section  2.1.5.  This  system 
provided  1000-best  lists  that  were  scored  with  three  different  models:  one  4-gram  LM  and  two 
RNN  LMs.  These  three  LMs  are  referred  to  as  forward  models  in  the  remainder  of  this  section.  A 
second  set  of  backward  RNN  LMs  were  developed  on  Gigaword  and  News  2007-2012.  These 
models  were  trained  using  the  same  procedure  described  in  Section  2.1.5,  except  that  the  word 
order  of  the  input  text  was  reversed  during  training  and  evaluation.  The  backward  RNN  LMs  were 
used  to  rescore  the  same  set  of  1000-best  lists. 

The  linear  interpolation  method  described  by  Equation  4  was  evaluated  using  L  =  5,  10,  20,  100 
and  the  interpolations  weights  Ak  were  optimized  using  the  compute-best-mix  program  from  the 
SRILM  toolkit.  Each  interpolation  method  was  evaluated  using  two  different  sets  of  LMs:  the  first 
set  included  the  three  forward  LMs,  and  the  second  set  included  the  three  forward  LMs  and  two 
backward  LMs.  Table  5  shows  the  WERs  obtained.  Linearly  interpolating  the  log  probabilities 
from  each  model  yielded  the  best  results,  especially  when  including  the  backward  LMs. 

2.1.6.  IWSLT  2014 

English  and  Italian  ASR  systems  were  developed  for  the  IWSLT  2014  evaluation  campaign.  This 
task  focuses  on  the  automatic  transcription  of  English  TED  talks  and  Italian  TEDx  talks.  TEDx 
talks  are  similar  to  TED,  but  given  on  a  wider  array  of  topics  at  independently  organized  events 
across  the  world.  Whereas  TED  talks  typcially  include  high  quality  speech,  TEDx  talks  are 
recorded  with  varying  degrees  of  quality  and  may  include  reverberated  speech,  background  noise, 
or  audio  compression  artifacts. 
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Table  5:  English  WER  on  the  IWSLT  Development  Partitions  using  Six  Different  Methods 
for  Interpolating  Probabilities  from  4-gram  and  RNN  LMs 

Each  method  was  evaluated  using  (1)  the  three  forward  LMs  and  (2)  the  three  foi'ward  LMs  and  two  backward 


LMs. 


Forward  LMs 

Forward  and  Backward  LMs 

Interpolation 

dev2010 

tst2010 

dev2012 

dev2010 

tst2010 

dev2012 

Linear 

12.1 

10.3 

11.6 

13.6 

12.4 

14.0 

Linear  L=5 

12.2 

10.4 

11.7 

13.9 

12.8 

14.5 

Linear  L=  10 

12.1 

10.3 

11.6 

13.8 

12.5 

14.2 

Linear  L=20 

12.1 

10.3 

11.6 

13.6 

12.4 

14.1 

Linear  L=  100 

12.1 

10.3 

11.6 

13.6 

12.4 

14.0 

Linear  maximum 

12.4 

10.9 

12.2 

14.7 

13.9 

15.5 

Linear  minimum 

12.4 

10.7 

12.2 

13.9 

12.4 

13.9 

Linear  median 

12.3 

10.4 

11.9 

12.2 

11.1 

12.0 

Log  linear 

11.8 

10.0 

11.6 

11.7 

9.9 

11.4 

English:  In  addition  to  the  TED  acoustic  data  described  in  Section  2.1.5,  AMs  were  trained  on 
broadcast  news  speech  from  the  HUB4  and  Euronews  [29]  corpora.  The  audio  from  each  corpus 
was  segmented  into  utterances  using  the  manually  produced  transcripts  for  HUB4  and  the 
provided  ASR  transcripts  for  Euronews.  All  utterances  were  processed  with  a  GMM-based 
bandwidth  detector  to  identify  and  remove  telephone  bandwidth  speech.  The  MIT-LL  GMM 
software  package  was  used  to  automatically  cluster  utterances  from  the  Euronews  corpus.  This 
process  yielded  128  hours  of  audio  from  HUB4  and  96  hours  from  Euronews. 

An  HMM  system  was  trained  on  TED  using  the  same  procedure  described  in  Section  2.1.1, 
except  that  feature  mean  and  variance  normalization  was  applied  on  per  speaker  basis.  The  final 
HMM  set  included  6000  shared  states  with  an  average  of  28  mixtures  per  state.  Hybrid  DNN- 
HMM  speech  recognition  systems  were  developed  on  TED,  HUB4,  and  Euronews  using  the 
same  procedure  described  in  Section  2.1.5.  The  GMM-HMM  set  included  8000  shared  states 
with  an  average  of  28  mixtures  per  state;  each  DNN  included  a  context  window  of  9  frames  in 
the  input,  7  hidden  layers  with  1000  units  each,  and  8000  output  units. 

LMs  were  developed  on  the  TED  data  provided  by  IWSLT,  the  English  Gigaword  corpus,  and 
the  News  2007-2013  texts  from  WMT.19Data  selection  was  implemented  using  the  same 
procedure  described  in  Section  2.1.5.  Interpolated  trigram  and  4-gram  LMs  were  estimated  on 
TED,  1/8  of  Gigaword,  and  1/8  of  News  2007-2013.  An  RNN  maximum  entropy  LM  was 
trained  on  the  same  set  of  training  texts  using  the  RNNLM  toolkit.  The  network  included  160 
hidden  units,  300  classes  in  the  output  layer,  4-gram  features  for  the  direct  connections,  and  a 
hash  size  of  109.  The  LM  vocabulary  included  100000  words  and  was  chosen  using  the  select- 
vocab  program  from  the  SRILM  toolkit. 

^Available  at:  http://workshop2014.iwslt.org 

19Availableat:  http://www.statmt.org/wmtl4/translation-task.html 
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Table  6:  English  WER  on  the  IWSLT  2014  Development  Partitions 

Results  are  given  for  the  HMM/DNN-HMM  systems  when  both  systems  were  evaluated  in  parallel. 


System 

dev2010 

tst2010 

dev2012 

Decode- 1 

14.8 

13.4 

16.2 

Decode-2 

14.6/14.3 

12.7/12.8 

15.3/14.8 

4-gram 

14.0/13.7 

12.3/12.1 

14.6/14.2 

4-gram  +  RNN 

13.0/12.6 

11.5/11.6 

13.7/13.3 

N-best  ROVER 

11.6 

10.4 

12.4 

The  decoding  procedure  is  shown  in  Figure  1.  Automatic  segmentation  of  the  test  data  was 
performed  using  the  same  procedure  described  in  Section  2.1.5,  and  initial  transcripts  were 
produced  using  HDecode  with  the  speaker  independent  hybrid  DNN-HMM  system  and  the 
trigram  LM.  The  HMM  system  and  the  speaker  adaptive  hybrid  DNN-HMM  system  were  then 
evaluated  in  parallel  using  the  following  decoding  strategy.  First,  the  initial  transcripts  were  used 
to  estimate  CMLLR  feature  transforms  for  each  speaker.  Next,  recognition  lattices  were 
generated  using  Sphinx-4  with  the  HMM  system  and  HDecode  with  the  speaker  adaptive  hybrid 
DNN-HMM  system.  The  lattices  were  rescored  with  the  interpolated  4-gram  LM,  and  1000-best 
lists  were  extracted  from  each  lattice  for  rescoring  with  the  RNN  LM.  The  final  LM  scores  were 
obtained  by  linearly  interpolating  the  log  probabilities  from  the  4-gram  and  RNN  LM.  Lastly, 
system  combination  was  performed  using  the  N-best  ROVER  program  from  the  SRILM  toolkit. 

Table  6  shows  the  WERs  obtained  on  the  IWSLT  development  partitions  at  each  decoding  stage. 
The  final  submission  to  the  IWSLT  evaluation  included  an  additional  Tandem  ASR  system  that 
was  developed  by  MIT-LL  [30].  The  recognition  lattices  from  this  system  were  rescored  using 
the  same  procedure  described  above,  and  the  outputs  from  all  three  systems  were  combined  using 
N-best  ROVER.  The  final  system  yielded  a  9.9%  WER  on  the  tst2014  partition  and  placed  third 
out  of  the  eight  ASR  systems  that  were  submitted  for  the  evaluation. 

Italian:  An  Italian  pronunciation  dictionary  was  manually  created  for  the  most  frequent  28000 
words  from  the  Euronews  corpus.  This  was  done  by  a  member  of  the  Speech  and  Communication 
Research,  Engineering,  Analysis,  and  Modeling  (SCREAM)  laboratory  who  speaks  Italian  as  a 
second  language.  The  5 1  phone  set  included  24  non-geminated  consonants,  20  geminated 
consonants,  and  7  vowels.  The  consonants  M,  N,  j,  w,  z  were  never  geminated  and  the  consonant 
n  was  always  geminated.  A  second  pronunciation  dictionary  with  32  phones  was  created  by 
ignoring  gemination.  Lastly,  a  multilingual  pronunciation  dictionary  was  created  from  the  Italian 
dictionary  that  ignored  gemination  and  version  0.7a  of  the  English  Carnegie  Mellon  University 
(CMU)  pronunciation  dictionary.  Italian  and  English  phones  were  merged  when  they  shared  the 
same  International  Phonetic  Alphabet  (IP A)  symbol.  Table  7  shows  the  phone  set  for  each 
language;  the  English  phones  are  in  ARP  Abet  format.-  The  multilingual  dictionary  included  48 
phones. 

20Thanks  to  Kyle  Wilkinson  for  creating  the  Italian  pronunciation  dictionary 

21Available  at:  http://www.speech.cs.cmu.edu/cgi-bin/cmudict 

22The  ARP  Abet  to  IPA  mappings  used  in  this  work  are  available  at:  http://en.wikipedia.org/wiki/Arpabet 
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Audio 


Transcript 


Figure  1:  IWSLT  2014  English  Decoding  Procedure 
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Table  7:  Italian  and  English  Phone  Sets 

Dashes  indicate  that  a  phone  does  not  exist  in  the  corresponding  language. 


IPA 

Italian 

English 

IPA 

Italian 

English 

IPA 

Italian 

English 

P 

P 

P 

s 

S 

S 

a 

A 

- 

b 

B 

B 

z 

z 

z 

u 

UW 

UW 

t 

T 

T 

S 

SH 

SH 

0 

0 

- 

d 

D 

D 

Z 

- 

ZH 

0 

AO 

AO 

k 

K 

K 

h 

- 

HH 

A 

- 

AA 

g 

G 

G 

ts 

TS 

- 

ae 

- 

AE 

m 

M 

M 

dz 

DZ 

- 

2 

- 

AH 

M 

EM 

- 

Ts 

CH 

CH 

@ 

- 

AX 

n 

N 

N 

dZ 

JH 

JH 

c 

- 

ER 

n 

NY 

- 

J 

Y 

Y 

I 

- 

IH 

N 

NG 

NG 

1 

L 

L 

u 

- 

UH 

r 

R 

R 

L 

GL 

- 

aU 

- 

AW 

f 

F 

F 

w 

W 

W 

al 

- 

AY 

V 

V 

V 

i 

IY 

IY 

el 

- 

EY 

T 

- 

TH 

E 

EH 

EH 

oU 

- 

OW 

D 

- 

DH 

e 

E 

- 

01 

- 

OY 

HMM  and  hybrid  DNN-HMM  systems  were  trained  on  the  Euronews  Italian  data  set  using  the 
same  procedure  as  the  English  systems.  One  HMM  system  was  trained  using  the  5 1  phone  set 
(denoted  as  HMM-5 1),  and  a  second  HMM  system  was  trained  using  the  32  phone  set  (denoted 
as  HMM-32).  HMM-5 1  included  6000  shared  states  with  an  average  of  28  mixtures  per  state, 
and  HMM-32  included  4000  shared  states  with  an  average  of  24  mixtures  per  state.  The  hybrid 
DNN-HMM  system  was  developed  using  HMM-5 1;  the  DNNs  included  3  hidden  layers  with 
1000  units  each  and  6000  output  units.  A  final  HMM  system  (denoted  as  HMM-ML)  was 
developed  on  Euronews  Italian  and  English  TED  using  the  multilingual  pronunciation 
dictionary;  HMM-ML  included  6000  shared  states  with  an  average  of  28  mixtures  per  state. 

'j  •j 

Interpolated  trigram  and  4-gram  LMs  were  developed  on  the  TED  data  provided  by  IWSLT, 
the  Google  Books  Ngram  corpus  [31],  and  the  Web  IT  5-gram  corpus  [32].  Words  from  the  TED 
data  set  were  split  on  apostrophes,  and  N-grams  from  Google  Books  were  ignored  if  the  source 
was  published  prior  to  the  year  2000.  The  TED  LMs  were  estimated  using  modified  Kneser-Ney 
smoothing;  the  Google  Books  and  Web  IT  LMs  were  estimated  using  Witten-Bell  smoothing. 

An  RNN  maximum  entropy  LM  was  trained  on  TED;  the  network  included  320  hidden  units, 

200  classes  in  the  output  layer,  4-gram  features  for  the  direct  connections,  and  a  hash  size  of  1 09. 
The  LM  vocabulary  included  100000  words  and  was  chosen  using  the  select-vocab  tool  from  the 
SRILM  toolkit. 

23 Available  at:  http://workshop2014.iwslt.org 
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Automatic  segmentation  of  the  test  data  was  initially  performed  using  the  same  procedure 
described  in  Section  2.1.5.  On  the  dev2014  partition,  it  was  discovered  that  the  SAD  was 
misclassifying  non-speech  segments  as  speech  on  several  TEDx  talks.  To  alleviate  this  problem, 
any  speech  segment  longer  than  20  seconds  was  reprocessed  with  a  previously  developed  neural 
network  based  SAD.  This  SAD  was  created  using  QuickNet  and  trained  on  English  telephone 
speech  from  the  Fisher  corpus  [33].  The  network  included  a  context  window  of  9  frames  on  the 
input,  1  hidden  layer  of  1400  units  with  logistic  activation  functions,  and  4  output  units 
corresponding  to  voiced  speech,  unvoiced  speech,  aspirated  speech,  and  non-speech.  The  feature 
set  consisted  of  12  PLP  coefficients,  plus  energy,  with  delta  and  acceleration  coefficients.  All 
features  were  globally  nonnalized  to  zero  mean  and  unit  variance.  As  with  the  English  system, 
speech  segments  from  each  talk  were  clustered  using  the  MIT-LL  GMM  software  package. 

Decoding  was  performed  as  follows.  Initial  transcripts  of  the  test  data  were  produced  using 
HDecode  with  the  speaker  independent  hybrid  DNN-HMM  system  and  the  trigram  LM.  The 
HMM-32,  HMM-ML,  and  speaker  adaptive  hybrid  DNN-HMM  systems  were  then  evaluated  in 
parallel  using  HDecode  with  the  same  decoding  strategy  as  the  English  system.  Finally,  system 
combination  was  perfonned  using  N-best  ROVER. 

It  was  discovered  that  there  were  a  number  of  errors  in  the  reference  transcripts  for  the  IWSLT 
dev2014  partition.  Therefore,  a  member  of  the  SCREAM  laboratory24  manually  corrected  the 
reference  transcripts  for  all  13  TEDx  talks.  Table  8  shows  the  WERs  on  the  dev2014  partition  at 
each  decoding  stage.  For  comparison  purposes,  results  are  included  without  cross  adaptation  of 
the  HMM-32  and  HMM-ML  systems;  that  is,  each  system  was  evaluated  independently  instead 
of  using  the  initial  transcripts  from  the  speaker  independent  hybrid  DNN-HMM  system.  Results 
are  also  included  when  N-best  ROVER  was  also  applied  at  each  decoding  stage.  From  Table  8 
we  can  see  that  cross  adaptation  of  the  HMM-32  and  HMM-ML  systems  improved  the  WER. 

The  final  system  yielded  a  23.0%  WER  on  the  tst20 14  partition  and  placed  second  out  of  the  four 
ASR  systems  that  were  submitted  for  the  evaluation. 

2.2  Haystack  MMIER  System 

This  section  describes  improvements  made  to  the  Haystack  MMIER  system.  Section  2.2.1 
discusses  improvements  made  to  the  user  interface.  Section  2.2.2  discusses  several 
improvements  that  were  made  to  the  processing  pipeline.  Section  2.2.3  describes  the  Japanese, 
Chinese,  and  Pashto  ASR  systems  that  were  developed  for  Haystack. 

2.2.1.  User  Interface  Improvements 

Recent  work  in  Haystack  has  seen  a  growth  from  version  0.6  to  0.8.  There  have  been  many 
additions  to  the  user  interface,  including  multiple  file  upload  abilities,  expansive  changes  to  the 
Haystack  Media  Player,  additional  MT  capabilities,  and  research  into  geolocation.  To  expand  the 
toolset  of  Haystack  there  has  been  the  testing  of  Optical  Character  Recognition  (OCR)  as  a  new 
avenue  for  media  translation  and  the  development  of  scripts  to  allow  for  complete  webpages  to 
be  uploaded  for  translation  but  keeping  the  fonnat  intact.  Future  growth  of  Haystack  includes 
cutting  ties  to  tools  that  limit  its  functionality  across  the  spectrum  of  operating  systems,  such 

24Thanks  to  Kyle  Wilkinson  for  correcting  the  transcripts 
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Table  8:  Italian  WER  on  the  IWSLT  dev2014  Partition 

The  HMM-32  and  HMM-ML  systems  were  evaluated  both  with  and  without  cross  adaptation.  N-best  ROVER  was 
applied  at  each  decoding  stage.  WER  was  calculated  using  (a)  the  provided  reference  transcripts  and  (b)  the  corrected 

reference  transcripts. 

(a)  Provided  Reference  Transcripts 


System 

Decode-1 

Decode-2 

4-gram 

4-gram  +  RNN 

No  Cross  Adaptation 

DNN-HMM 

35.0 

32.9 

32.5 

32.5 

HMM-32 

41.2 

34.4 

34.1 

33.9 

HMM-ML 

42.7 

35.9 

35.7 

35.4 

N-best  ROVER 

35.2 

31.3 

30.8 

30.8 

With  Cross  Adaptation 

DNN-HMM 

35.0 

32.9 

32.5 

32.5 

HMM-32 

- 

32.2 

31.8 

31.4 

HMM-ML 

- 

32.4 

32.3 

32.3 

N-best  ROVER 

— 

30.1 

29.7 

29.5 

(b)  Corrected  Reference  Transcripts 


System 

Decode-1 

Decode-2 

4-gram 

4-gram  +  RNN 

No  Cross  Adaptation 

30.7 

27.9 

27.6 

27.8 

HMM-32 

37.3 

29.8 

29.4 

29.4 

HMM-ML 

39.1 

31.3 

31.0 

30.9 

N-best  ROVER 

31.4 

26.7 

26.3 

26.4 

With  Cross  Adaptation 

DNN-HMM 

30.7 

27.9 

27.6 

27.8 

HMM-32 

- 

27.3 

27.0 

26.6 

HMM-ML 

- 

27.5 

27.5 

27.5 

N-best  ROVER 

- 

25.3 

25.0 

25.0 

as  Adobe  Flash,  so  research  into  HTML5  was  initiated. 

Multiple  File  Upload :  There  was  a  need  for  uploading  multiple  fdes  into  the  Haystack  service.  A 
Flash  application  was  developed  for  opening  a  fde  directory  window  and  allowing  for  multiple 
file  selection  for  upload.  Once  the  files  are  uploaded,  an  interface  is  created  to  display  the 
uploaded  files  in  the  queue,  and  each  file  is  automatically  processed  with  a  metadata  scan  by 
FFmpeg  for  duration,  codec,  sample  rate,  etc.  Each  file  also  has  form  fields  that  can  be  populated 
with  file  information,  such  as  the  source,  title,  and  source  language. 
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Figure  2:  The  Multiple  File  Upload  Application  after  Processing  an  HFL  File 

For  ease  in  uploading  files  across  servers  and  from  various  directories,  a  Haystack  File  List 
(HFL)  format  was  developed.  This  tab-separated  text  file  can  be  created  that  includes  file 
location,  source,  title,  language,  and  keywords  for  multiple  file  data;  it  can  be  uploaded  to  the 
system  and  processed  all  at  once.  Figure  2  shows  a  screenshot  of  the  multiple  file  upload 
application  after  processing  an  HFL  file. 

Geolocation\  GeoNames  is  a  creative  commons  geographical  database  with  over  10  million 
geographical  names  integrated  with  geographical  data  such  as  population,  elevations,  and  lati¬ 
tude/longitude  coordinates.  After  reconfiguring  the  Solr  schema,  the  GeoNames  database  was 
indexed  and  experiments  begun  on  linking  the  geographical  coordinates  to  named  entities 
identified  by  Janya  in  English  and  Chinese.  A  new  search  interface  was  also  created  to  allow  for 
searching  for  Haystack  entries  containing  geographical  locations  or  anything  within  a  specific 
latitude/longitude  distance. 

Relevancy  was  an  issue  because  of  the  nature  of  redundant  names  in  locations  throughout  the 
world  so  some  methods  were  implemented  to  improve  the  reliability.  Lists  of  population  and 
popularity  were  created  to  check  against  results  from  geographical  queries,  and  scripts  were 
developed  to  add  weight  to  the  results  in  favor  of  those  lists. 

The  next  step  involved  integrating  the  OpenLayers  library  for  displaying  map  data  into 
Haystack.  Scripts  were  developed  for  integrating  the  list  of  locations  into  a  tab-based  interface 
and  displaying  the  location  markers  on  the  map  with  links  to  Wikipedia  entries. 

Media  Player.  The  Media  Player  section  of  Haystack  has  been  subjected  to  constant  updates. 
JQuery  opened  up  many  new  options  in  easing  operation  and  customization  for  the  user.  A  more 
graceful  tab-based  system  was  created  across  the  top  of  the  page  to  allow  for  easy  access  to 
utterances,  translations  by  MT  engine,  speaker,  topic,  file  metadata,  and  geographical  functions. 
Also  accessible  through  the  tabs  are  a  link  to  the  auxiliary  file  data  used  in  the  pipeline  process 
for  Haystack  and  a  link  to  a  viewer  for  the  log  file. 

The  code  was  rewritten  to  update  the  dynamic  modal  windows  for  captioning.  This  update  allows 
for  smoother  opening  and  closing  of  windows  and  better  control  of  dragging  the  windows  for 
placement  or  resizing  with  corner  click-dragging.  By  fine  tuning  the  window  controls,  it  is  now 


25 Available  at:  http://www.geonames.org 
26 Available  at:  http://openlayers.org 
27 Available  at:  http://jquery.com 
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very  rarely 
do  we  ever  and 

if  we  find  out  at  some  stage  that  we  destroy  that  information  isn't  as 
BWsteve  WtL°.  1 

Figure  3:  Highlight  and  Scrolling  Functionality  Active  in  an  Utterance  Window 

standard  that  when  a  play  command  is  given  from  the  search  results  whichever  MT  engine 
chosen  is  the  default  window  open  in  the  Player. 

The  viewer  for  of  processed  text  files  has  been  updated  so  that  all  of  the  translated  text  is  parallel 
with  the  source  material  and  each  MT  engine  output  is  tab-based  for  viewing  one  at  a  time  or  all 
at  once. 

Highlighting-.  In  Haystack,  Solr  is  used  for  querying  the  vast  index  of  processed  media  files,  but 
once  within  the  Media  Player  section  of  a  specific  file,  a  page-centric  search  functionality  was 
created  that  would  highlight  every  occurrence  of  those  results. 

This  within  page  search  allows  a  user  to  input  a  search  term  and  see  that  term  highlighted  in  each 
available  window.  Each  window,  in  turn,  has  its  own  controller  for  scrolling  back  and  forth 
between  each  highlighted  term  and  can  begin  playing  the  file  from  that  point.  Figure  3  shows  a 
screenshot  of  the  hightlight  and  scrolling  functionality  active  in  an  Utterance  window. 

HTML 5:  The  initial  file  upload  and  media  viewing  capabilities  of  Haystack  were  Flash-based 
applications.  As  HTMF5  evolved  and  browsers  began  to  adopt  its  functionality,  research  began 
on  how  it  might  be  leveraged  to  replace  the  Flash-based  tools  within  the  Haystack  system.  A 
rudimentary  player  was  developed  using  HTMF5,  with  simple  controls  and  limited  captioning 
options.  To  allow  for  cross-browser  compatibility  of  the  player,  functionality  was  added  to  the 
pipeline  to  covert  audio  and  video  files  into  OGG  and  MP4  fonnats. 

A  new  File  Upload  system  was  developed  in  HTMF5  that  allows  for  thumbnail  viewing  and 
playing  and  shows  upload  progress.  The  system  allows  for  multiple  file  select  but  from  only  a 
standard  file  browser  window,  limiting  the  capabilities  available  through  the  HFF  file  option  in 
the  original  Multiple  File  Upload  system.  Research  was  conducted  on  making  the  File  Upload 
system  more  robust. 

HTML  Conversion :  One  missing  factor  in  Haystack  was  the  ability  to  upload  or  point  to  a 
webpage  address  and  upload  it  for  translation.  A  technique  was  developed  using  JavaScript  and 
the  Document  Object  Module  (DOM)  for  place  keeping  of  the  text  and  images  so  that  post¬ 
translation  they  could  be  placed  back  in  the  correct  format. 
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Figure  4:  The  Output  from  the  HTML  Conversion  and  Translation  Tools 

A  prototype  was  created  that  parsed  the  HTML  and  sent  only  the  text  on  for  translation.  The 
returned  text  was  integrated  back  into  the  HTML  framework  and  the  results  can  be  viewed  side- 
by-side.  For  the  prototype,  only  the  Systran7  MT  engine  was  used.  Figure  4  shows  a  screenshot 
of  the  HTML  conversion  application.  Development  is  continuing  to  make  the  HTML  uploader  a 
fully  functioning  part  of  the  Haystack  system. 

OCR'.  The  SCREAM  Lab  received  a  copy  of  the  Raytheon/BBN  Document  Analysis  Service 
(DAS)  to  test  out  as  an  OCR  option  to  use  in  Haystack.  The  default  language  system  packaged 
with  it  was  Chinese. 

Research  began  on  creating  a  pipeline  to  tie  into  the  DAS  system  and  to  optimize  the  input  and 
output  for  best  translation.  After  considerable  testing,  an  image  resolution  of  400  dots  per  inch 
(DPI)  was  considered  optimal.  The  initial  phase  allows  for  an  image  to  be  uploaded  into  the 
Haystack  system,  but  the  second  stage  is  prompted  by  a  command  line  instruction  to  the  DAS 
system  itself  to  commence  the  OCR  operation  and  output  the  Extensible  Markup  Language 
(XML-based  files  back  into  the  Haystack  directory  structure.  A  third  phase  is  then  initiated  to 
translate  the  extracted  text  and  place  it  back  into  the  framework  of  a  newly  created  viewer  in  the 
Media  Player  section. 

Continued  development  of  the  system  for  the  Chinese  DAS  OCR  image  viewer  allows  for  tabs  to 
see  extracted  text,  translated  text,  and  scanned  zones  of  the  image — also  allowing  for  clicking  on 
zones  within  the  main  image  to  scroll  to  the  translation/OCR  segment.  A  zoom  function  was 
developed  so  that  scanned  segments  could  be  viewed  at  a  greater  magnification  to  check  against 
the  OCR  output.  Figure  5  shows  the  prototype  output  of  intergrating  OCR  into  Haystack. 

Machine  Translation :  A  pipeline  was  developed  to  integrate  the  Moses  machine  translation 
server  into  Haystack.  Systems  were  integrated  for  French,  Spanish,  German,  Farsi,  Pashto, 

Arabic  and  to  normalize,  tokenize,  and  recase  the  text 
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Figure  5:  The  Prototype  Output  of  Integrating  OCR  into  Haystack 

Using  the  same  procedure  for  calling  Systran5  via  CyberTrans  for  MT,  a  system  was  created  that 
brought  in  Gister  for  translating  in  17  languages  and  Motrans  for  Arabic  and  Portuguese. 

With  the  availability  of  Systran7,  integration  of  its  resources  into  the  MT  pipeline  was 
accomplished.  This  latest  version  changed  the  AJAX  parameters  used  by  Systran5,  so  a  new 
solution  was  developed.  Incorporating  Systran7  added  another  source  for  translation  of  Arabic 
and  Urdu. 

XTrans :  Xtrans“  is  a  transcription  tool  allowing  for  transcription  and  annotation  of  audio 
recordings.  With  the  help  from  a  script  written  by  Mr.  Eric  Hansen  to  convert  Haystack-specific 
XML  files  to  the  tab-delimited  files  used  by  XTrans,  functionality  was  added  to  the  Haystack 
Media  Player  to  allow  a  user  to  click  through  to  be  given  a  command  line  instruction  that  can  start 
up  Xtrans  in  a  Linux  terminal.  Figure  6  shows  an  Xtrans  window  running  from  the  command  line 
instructions  created  in  Haystack.  Future  development  in  this  area  will  include  the  ability  for  the 
linguists  to  upload  changes  made  to  the  transcription  and  to  see  the  results  of  those  changes  within 
the  Haystack  Media  Player. 

2.2.2.  Pipeline  Improvements 

Several  improvements  were  made  to  the  Haystack  processing  pipeline.  Major  additions  include  the 
following:  support  for  decoding  hybrid  DNN-HMM  speech  recognition  systems,  support  for  N- 
gram  and  RNN  LM  rescoring  in  the  ASR  pipeline,  and  improved  text  extraction  from  PDF  files. 
Hybrid  DNN-HMM  decoding  is  implemented  using  the  Theano  and  Sphinx-4  software  described 
in  Section  2.1.4.  LM  rescoring  is  applied  using  the  following  procedure.  First,  N-best  lists  are 
extracted  from  each  recognition  lattice  and  rescored  with  the  specified  N-gram  and  RNN  LMs. 

The  SRILM  toolkit  is  used  to  extract  the  N-best  lists  and  apply  N-gram  rescoring;  the  RNNLM 
toolkit  is  used  for  RNN  rescoring.  Next,  the  log  probabilities  from  each  model  are  linearly 


28 Available  at:  https://www.ldc.upenn.edu/language-rescources/tools/xtrans 
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Figure  6:  XTrans  Window  Running  from  the  Command  Line  Instructions  Created  in 

Haystack 

interpolated  as  described  by  Equation  5.  Finally,  the  maximum  scoring  hypothesis  is  selected  for 
each  utterance. 

Text  extraction  from  PDF  files  was  implemented  using  PDFMiner.  PDFMiner  provides  the 
location,  font  style,  and  font  size  of  each  character,  and  groups  sequences  of  characters  into  lines. 
Software  was  developed  to  reverse  the  character  ordering  of  right-to-left  text  and  automatically 
merge  lines  of  text  into  paragraphs.  Paragraph  boundaries  were  inserted  by  considering  the 
following  factors:  font  size,  text  direction,  vertical  spacing,  indents  on  the  first  line  of  a  paragraph, 
and  text  margins. 

Minor  improvement  to  the  Haystack  pipeline  include  the  following.  First,  the  code  was  modified 
so  that  all  documents  are  submitted  for  processing  using  Open  Grid  Scheduler  (OGS).  Second, 
video  conversion  is  performed  in  parallel  with  the  rest  of  the  processing  pipeline.  Third,  the 
conversion  routine  was  updated  so  that  SoX  is  used  to  modify  the  audio  sample  rate  and  normalize 
the  audio  volume.  Fourth,  the  video  thumbnail  extraction  routine  was  updated  to  use  automatic 
scene  detection  and  select  the  image  with  the  highest  entropy.  Fastly,  English  text  recasing  is  now 

3 1 

performed  using  scripts  from  the  Moses  distribution. 

29Available  at:  http://pypi.python.org/pypi/pdfminer 
30Available  at:  http://gridscheduler.sourceforge.net 
31Available  at:  http://www.statmt.org 
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2.2.3.  ASR  Systems 

Japanese,  Chinese,  and  Pashto  ASR  systems  were  developed  for  Haystack.  HMMs  were  trained 
for  each  language  using  the  same  procedure  described  in  Section  2.1.1,  except  that  the  Chinese 
system  used  a  modified  feature  set;  trigram  LMs  were  estimated  using  the  SRILM  toolkit.  The 
remainder  of  this  section  describes  the  systems  in  more  detail  and  presents  recognition  results  for 
each  language. 

Japanese:  AMs  were  trained  on  20  hours  of  audio  from  GlobalPhone.  The  GlobalPhone 
transcripts  are  provided  with  spacing  between  words,  and  include  mappings  from  kanji  to 
katakana  and  hiragana.  Note  that  Japanese  text  is  usually  written  without  spacing  between  words, 
and  includes  a  combination  of  kanji,  hiragana,  and  katakana.  Kanji  are  Chinese  characters; 
hiragana  and  katakana  are  syllabaries.  A  pronunciation  dictionary  was  manually  created  using 
the  katakana  and  hiragana  transcripts  with  the  Omniglot  phoneme  set  [34],  The  final  HMM  set 
included  2000  shared  states  with  an  average  of  16  mixtures  per  state. 

An  interpolated  trigram  LM  was  estimated  on  the  GlobalPhone  transcripts  and  articles 
downloaded  from  Wikipedia.  The  JUMAN  morphological  analyzer  was  used  to  segment  the 
Wikipedia  text  into  words  and  convert  the  kanji  to  hiragana  and  katakana.  The  LM  was  trained 
on  the  text  that  included  kanji,  and  the  pronuciation  dictionary  was  created  using  the  hiragana 
and  katakana.  The  LM  vocabulary  included  the  65000  most  common  words. 

This  system  yielded  a  21.1%  WER  on  the  GlobalPhone  development  partition.  For  comparison 
purposes,  the  AMs  were  evaluated  with  a  trigram  LM  estimated  on  GlobalPhone  only,  which 
yielded  a  25.5%  WER. 

Chinese:  AMs  were  trained  on  175  hours  of  audio  from  the  Global  Autonomous  Language  Ex¬ 
ploitation  (GALE)  corpus.  The  GALE  text  was  first  segmented  into  words  using  the  Linguistic 
Data  Consortium  (LDC)  Chinese  word  segmenter.  A  pronunciation  dictionary  was  created  by 
mapping  the  Chinese  characters  to  pinyin34  and  splitting  the  pinyin  into  a  95  phoneme  set  that 
includes  tone  markings.  Pronunciations  for  English  words  were  obtained  by  mapping  phonemes 
from  the  English  CMU  pronunciation  dictionary  to  the  Chinese  phoneme  set  and  training  a 
Sequitur  grapheme-to-phoneme  system  [35]. 

The  HMM  set  included  4000  shared  states  with  an  average  of  28  mixtures  per  state.  The  feature 
set  consisted  of  12  PLPs,  plus  the  zeroth  coefficient,  with  mean  normalization  applied  on  a  per 
utterance  basis.  A  pitch  feature  was  extracted  using  the  Entropic  Signal  Processing  System 
(ESPS)  method  implemented  in  the  Snack  toolkit;  pitch  values  over  unvoiced  segments  were 
defined  using  the  method  of  [36].  Delta  and  acceleration  coefficients  were  appended  to  form  a  42 
dimensional  feature  vector. 

An  interpolated  trigram  LM  was  estimated  on  GALE,  the  fifth  edition  of  the  Chinese  Gigaword 
corpus  [37],  and  broadcast  news  transcripts  from  HUB4-NE  [38].  The  text  was  segmented  into 


32 Available  at:  http://dumps.wikimedia.org/jawiki 

33  Available  at:  http://nlp.ist.i.kyoto-u.ac.jp/EN/index.php?JUMAN 

34The  Unicode  to  pinyin  mappings  used  in  this  work  are  available  at:  http://www.ic.unicamp.br/~stolfi/voynich/ 
Notes/06 1/uc-to-py.tbl 

35Available  at:  http://www.speech.kth.se/snack 
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words  using  the  LDC  Chinese  word  segmenter,  and  the  final  vocabulary  included  53100  words. 
This  system  yielded  an  1 1.2%  CER  on  the  HUB4-NE  test  partition.  For  comparison  purposes,  a 
previously  developed  ASR  system  yielded  a  14.4%  CER  on  the  same  test  set;  that  system  was 
trained  on  HUB4-NE  acoustic  and  text  data,  plus  the  fourth  edition  of  the  Chinese  Gigaword 
corpus. 

Pashto:  AMs  were  trained  on  43  hours  of  audio  from  the  Appen  Broadcast  News  (BRC001) 
corpus  and  104  hours  from  the  TRANSTAC  corpus.  The  speech  segments  from  BRC001  were 
automatically  clustered  using  the  MIT-LL  GMM  software  package.  Pronunciations  for  all  words 
were  derived  using  the  TRANSTAC  dictionary  and  a  Sequitur  grapheme-to-phoneme  system. 
Note  that  all  diacritics  were  removed  from  the  dictionary  prior  to  training  the  Sequitur  models 
since  diacritics  are  not  included  in  the  transcripts.  The  final  HMM  set  included  4000  shared  states 
with  an  average  of  24  mixtures  per  state. 

An  interpolated  trigram  LM  was  estimated  on  the  training  transcripts  using  the  full  26157  word 
vocabulary  .  This  system  yielded  a  22.1%  WER  on  the  BRC001  development  partition.  A  second 
LM  was  estimated  using  additional  text  data  from  Sada-e  Azadi  and  Wikipedia,  however,  this 
LM  did  not  yield  an  improvement  in  system  performance. 


36Available  at:  http://www.sada-e-azadi.net 
37Available  at:  http://dumps.wikimedia.org/pswiki 
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CONCLUSIONS 


In  conclusion,  work  has  been  accomplished  in  the  areas  of  ASR  and  infonnation  extraction, 
especially  in  the  context  of  the  Haystack  MMIER  system. 

For  ASR,  Korean  systems  were  developed  using  both  words  and  sub-word  units  that  can  be 
combined  to  form  words;  this  was  done  in  an  effort  to  reduce  the  effects  of  00 V  words 
encountered  by  the  recognizer.  Using  sub-word  units  yielded  a  small  improvement;  however,  the 
final  WERs  are  still  high  compared  to  similar  systems  developed  on  other  languages.  Levantine 
Arabic  and  Farsi  ASR  systems  were  trained  on  conversational  telephone  speech.  All  systems 
yielded  WERs  above  50%,  which  is  not  entirely  unexpected  since  these  systems  were  trained  on 
relatively  small  corpora  and  used  GMM-based  AMs.  Three  methods  were  investigated  for 
combining  Pashto  speech  recognition  systems:  ROVER,  N-best  ROVER,  and  word  posterior 
decoding  using  DDA  matching  scores.  All  methods  yielded  better  performance  than  any  single 
ASR  system,  and  the  best  WER  was  obtained  using  N-best  ROVER.  Software  was  developed  for 
training  and  evaluating  hybrid  DNN-HMM  speech  recognition  systems,  which  generally  yield 
better  perfonnance  than  GMM-  HMMs.  This  software  was  used  to  train  an  English  ASR  system 
for  the  IWSLT  2013  evaluation,  which  placed  third  out  of  the  eight  systems  that  were  submitted 
for  the  evaluation.  Several  methods  were  investigated  for  interpolating  probabilities  from  4-gram 
and  RNN  LMs;  interpolating  log  probabilities  yielded  the  best  overall  WER.  Finally,  English  and 
Italian  ASR  systems  were  developed  for  the  IWSLT  2014  evaluation.  The  English  system  placed 
third  out  of  the  eight  ASR  systems  that  were  submitted  for  the  evaluation,  and  the  Italian  system 
placed  second  out  of  four. 

Work  on  Haystack  over  this  period  has  seen  a  lot  of  growth  in  functionality  and  an  evolving  user 
interface  with  a  focus  on  making  a  large  amount  of  information  easily  available  through  a  multi¬ 
file  upload  ability,  a  Media  Player  with  various  new  options,  and  additional  MT  capabilities.  The 
toolset  has  expanded  greatly  with  research  into  geolocation,  testing  of  OCR  for  media 
translation,  the  ability  to  upload  and  translate  webpages,  and  adapting  Haystack  to  use  HTML5 
for  media  play  and  file  upload.  Major  additions  to  the  processing  pipeline  include  support  for 
decoding  hybrid  DNN-HMM  systems,  support  for  N-gram  and  RNN  LM  rescoring,  and 
improved  text  extraction  from  PDF  files.  Japanese,  Chinese,  and  Pashto  ASR  systems  were 
developed  and  then  incorporated  into  Haystack. 
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LIST  OF  ACRONYMS  &  GLOSSARY 


AJAX  Asynchronous  JavaScript  and  Extensible  Markup  Language 

AM  Acoustic  Model 

ARB-CTS  Levantine  Arabic  conversational  telephone  speech  corpus  released  by  the 

Lingustic  Data  Consortium 

ARB-QT  Levantine  Arabic  conversational  telephone  speech  corpus  released  by  the 

Lingustic  Data  Consortium 

ASR  Automatic  Speech  Recognition 

ASR00 1  Farsi  telephone  prompt  speech  corpus  released  by  Appen 

ASR002  Farsi  conversational  telephone  speech  corpus  released  by  Appen 
BRC001  Pashto  broadcast  news  corpus  of  text  and  audio  released  by  Appen 

CER  Character  Error  Rate 

CMLLR  Constrained  Maximum  Likelihood  Linear  Regression 

CMU  Carnegie  Mellon  University 

CyberTrans  Machine  translation  system  developed  by  the  U.S.  government 

DAS  Document  Analysis  Service 

DDA  Driven  Decoding  Algorithm 

DNN  Deep  Neural  Network 

DOM  Document  Object  Module 

DP  Dynamic  Programming 

DPI  Dots  Per  Inch 

ESPS  Entropic  signal  processing  system 

Euronews  Multilingual  broadcast  news  corpus  of  text  and  audio 

FFmpeg  Cross-platfonn  software  for  recording,  converting,  and  streaming  audio  and 

video 

Fisher  An  English  conversational  telephone  speech  corpus  released  by  the  Lingustic 

Data  Consortium 

FLARe  Foreign  Language  Analysis  and  Recognition 

GALE  Global  Autonomous  Language  Exploitation 

GeoNames  A  creative  commons  database  with  over  10  million  geographical  names 
integrated  with  geographical  data  such  as  population,  elevations,  and 
latitude/longitude  coordinates 

Gister  Machine  translation  system  developed  by  the  U.S.  government 

GlobalPhone  Multilingual  speech  and  text  database 
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GMM 

GPU 

Haystack 


HDecode 

HFL 

HLDA 

HMM 

HTK 

HTML 

HUB4 

HUB4-NE 

ICSI 

IPA 

IWSLT 

JavaScript 

JQuery 

JUMAN 


kHz 

LDC 

LM 

MIT-LL 

MMIER 

Morfessor 


Moses 

MPE 

MT 

NIST 


Gaussian  Mixture  Model 
Graphical  Processing  Unit 

An  internal  lab  project  to  integrate  various  capabilities  into  a  system  to  index, 
analyze,  translate,  store,  and  retrieve  multilingual  information  from  rich 
multimedia  documents  in  various  languages 

Cambridge  University  large  vocabulary  continuous  speech  recognizer 
Haystack  File  List 

Heteroscedastic  Linear  Discriminate  Analysis 
Hidden  Markov  Model 

Cambridge  University  Hidden  Markov  Model  Toolkit 
Hypertext  Markup  Language 

An  English  broadcast  news  corpus  of  text  and  audio  released  by  the 
Linguistic  Data  Consortium 

A  non-English  broadcast  news  corpus  of  text  and  audio  released  by  the 
Linguistic  Data  Consortium 

International  Computer  Science  Institute 

International  Phonetic  Alphabet 

International  Workshop  on  Spoken  Language  Translation 

A  script  language  typically  used  to  enable  programmatic  access  to 
computational  objects  within  a  host  environment,  commonly  a  web  browser 

An  open  source  JavaScript  library  for  dynamic  update  and  control  of  web  pages 

A  user-extensible  morphological  analyzer  for  Japanese  developed  at  Kyoto 
University 

Kilohertz 

Linguistic  Data  Consortium 
Language  Model 

Massachusetts  Institute  of  Technology  Lincoln  Laboratory 

multilingual  multimedia  infonnation  extraction  and  retrieval 

Software  developed  at  Helsinki  University  of  Technology  for  unsupervised 
learning  of  morphology 

A  statistical  machine  translation  system 

Minimum  Phone  Error 

Machine  Translation 

National  Institute  of  Standards  and  Technology 
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OCR 

OGS 

OOV 

PDF 

PDFMiner 

PLP 

Python 

QuickNet 

RNN 

ROVER 

SAD 

SAT 

SCREAM 

SoX 

Sphinx-4 

SRILM 

Systra 

TED 

TEDx 

Theano 

TRANSTAC 

WER 

WMT 

XML 

Xtrans 


Optical  Character  Recognition 
Open  Grid  Scheduler 
Out-of- Vocabulary 
Portable  Document  Format 

a  tool  for  extracting  infonnation  from  portable  document  format  files 
Perceptual  Linear  Prediction 
High  level  programming  language 

Software  developed  at  the  International  Computer  Science  Institute  for  training 
and  evaluating  multi-layer  perceptrons 

Recurrent  Neural  Network 

Recognizer  Output  Voting  Error  Reduction 

Speech  Activity  Detector 

Speaker  Adaptive  Training 

Speech  and  Communication  Research,  Engineering,  Analysis,  and 
Modeling 

Sound  Exchange  Toolkit 

Carnegie  Mellon  University  large  vocabulary  continuous  speech  recognizer 
a  language  modeling  toolkit  developed  at  Stanford  Research  Institute 
Commercial  machine  translation  system 
Technology,  Entertainment,  and  Design 
an  independently  organized  TED-like  event 

Numerical  computational  library  for  Python  that  can  be  compiled  to  run  on  a 
graphical  processing  unit 

Translation  System  for  Tactical  Use 

Word  Error  Rate 

Association  for  Computational  Linguistics  Workshop  on  Machine 
Translation 

Extensible  Markup  Language 

Transcription  tool  developed  by  the  Linguistic  Data  Consortium 
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